Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PENDING type to healthchecks #360

Merged
merged 2 commits into from
Oct 11, 2024
Merged

Add PENDING type to healthchecks #360

merged 2 commits into from
Oct 11, 2024

Conversation

andythsu
Copy link
Member

@andythsu andythsu commented May 24, 2024

Description

Resolves #222 part 1

Additional context and related issues

Release notes

(X) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

* 

@cla-bot cla-bot bot added the cla-signed label May 24, 2024
Copy link
Contributor

@rdsarvar rdsarvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small nitpicks, mostly around use of a 'healthy' variable that has > 2 states

@@ -46,7 +47,7 @@ static class LocalStats
{
private int runningQueryCount;
private int queuedQueryCount;
private boolean healthy;
private TrinoHealthStateType healthy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what are your thoughts on having this be heathState instead of healthy?

as well as respective places where healthy is used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. healthy kinda has a binary implication

@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier)
public void observe(List<ClusterStats> clustersStats)
{
for (ClusterStats clusterStats : clustersStats) {
if (!clusterStats.healthy()) {
if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: feels weird to read does healthy() = UNHEALTHY ?, would healthState() == UNHEALTHY read better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just clusterStats.Health() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, open to whichever naming - seeing the ‘y’ suffix (to me) on healthy implies boolean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like healthState() == UNHEALTHY

Copy link
Member

@Chaho12 Chaho12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about TrinoHealthStateType to TrinoHealthType ? i find it quite a redundant to say healthState as health is state of some status.

@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier)
public void observe(List<ClusterStats> clustersStats)
{
for (ClusterStats clusterStats : clustersStats) {
if (!clusterStats.healthy()) {
if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just clusterStats.Health() ?

Copy link
Member

@Chaho12 Chaho12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conversation after fixes has been made.
It makes it easier to PR (know that fix has been made for the commend)

LGTM 👍

@andythsu andythsu force-pushed the healthstate branch 2 times, most recently from f1a2261 to bca99f9 Compare May 31, 2024 17:27
Comment on lines 16 to 21
/**
* PENDING is for ui/observability purpose and functionally it's unhealthy
* We should use PENDING when Trino clusters are still spinning up
* HEALTHY is when health checks report clusters as up
* UNHEALTHY is when health checks report clusters as down
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be added to the docs. As to placement I think a section on health checks should be added, and linked to from the routing logic and operation sections. Wdyt @mosabua ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we want to put this in now that we have new doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info

@andythsu
Copy link
Member Author

andythsu commented Jun 6, 2024

@willmostly

TestGatewayMultipleBackend uses the TrinoContainer from TestContainers for trino1 and trino2, which does not finish its startup() until SELECT 1 returns. So the health check should succeed. customBackend is a MockWebServer so you will need to add an endpoint to satisfy the healthcheck. I do not believe you should need to set health status manually through injection

Sorry for the confusion.

In TestGatewayMultipleBackend and TestGatewaySingleBackend, the clusters are added by calling the post api

The default healthstate when clusters are first added to the gateway is PENDING (and should be). Because PENDING is functionally treated as unhealthy, the test cases will fail (for example,

Request request =
new Request.Builder()
.url("http://localhost:" + routerPort + "/v1/statement")
.addHeader("X-Trino-User", "test")
.post(requestBody)
.build();
) since all clusters' states are still in PENDING when the test cases run. Unless we wait until the first round of healthcheck kicks in and changes the states from PENDING to HEALTHY, no clusters are available.

Copy link

cla-bot bot commented Sep 25, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Andy Su (Apps).
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Copy link
Member

@vishalya vishalya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to be merged. Does anyone have any further comments? We can wait for a day or two and merge it.

Copy link
Contributor

@rdsarvar rdsarvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one potential change to the new test modifications, otherwise lg2m as a first step

while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) {
// check the state of newly added cluster every second
if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we be adding some form of yield here so the while loop isn't burning through CPU cycles? maybe a 100ms sleep or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in Thread.sleep(100);? In that case do we still need this while loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 you should refactor more like:

        int timeout = 10 * 1000;
        while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) {
          // do whatever logic
          Thread.sleep(100);
        }

Copy link
Contributor

@rdsarvar rdsarvar Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd still need the while loop right, as you want to check the cluster healthy every 1 second or so?

you could update this to be:

            if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
                Thread.sleep(System.currentTimeMillis() - lastExecutionTime)
            }

Or later on for the if (response.isSuccessful()) { call you could have an else Thread.sleep(1000) and remove this current if-condition

@@ -86,8 +95,13 @@ public Response updateEntity(
ProxyBackendConfiguration backend =
OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class);
gatewayBackendManager.updateBackend(backend);
log.info("Setting up the backend %s with healthy state", backend.getName());
routingManager.updateBackEndHealth(backend.getName(), backend.isActive());
log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on/off may be a little confusing, let's stick to the same terminology used in the code?

                    TrinoHealthStateType healthState = backend.isActive() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.UNHEALTHY;
                    log.info("Marking cluster '%s' with health state %s", backend.getName(), healthState);
                    routingManager.updateBackEndHealth(backend.getName(), healthState);
...

Copy link
Member Author

@andythsu andythsu Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PENDING is more like an internal state that's managed by healthcheck. In this case since it's turned on/off by the UI I think it's fine here. Also, if we use the healthState, it's going to say "Marking cluster cluster_a with health state PENDING", which is a bit weird IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway

Copy link
Member Author

@andythsu andythsu Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about "Marking the cluster active/inactive"?

while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) {
// check the state of newly added cluster every second
if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 you should refactor more like:

        int timeout = 10 * 1000;
        while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) {
          // do whatever logic
          Thread.sleep(100);
        }

@andythsu andythsu force-pushed the healthstate branch 2 times, most recently from ce2d321 to 6bf8f6d Compare October 3, 2024 04:07
Copy link
Contributor

@rdsarvar rdsarvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg2m

Comment on lines 16 to 21
/**
* PENDING is for ui/observability purpose and functionally it's unhealthy
* We should use PENDING when Trino clusters are still spinning up
* HEALTHY is when health checks report clusters as up
* UNHEALTHY is when health checks report clusters as down
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info

@@ -86,8 +95,13 @@ public Response updateEntity(
ProxyBackendConfiguration backend =
OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class);
gatewayBackendManager.updateBackend(backend);
log.info("Setting up the backend %s with healthy state", backend.getName());
routingManager.updateBackEndHealth(backend.getName(), backend.isActive());
log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway

monitorType: INFO_API

monitor:
taskDelaySeconds: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we look at using Airlift duration instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I put this in a separate PR? This will require some code changes

Copy link
Member

@mosabua mosabua Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure .. ideally soon though so it can ship in the same release .. otherwise it will be a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config has been there for a while now so it will be a breaking change for this or next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay .. then it should definitely be a separate change. Please file an issue for that work so we dont forget.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #520

Comment on lines 19 to 20
* HEALTHY is when health checks report clusters as up
* UNHEALTHY is when health checks report clusters as down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing HEALTHY/UNHEALTHY to UP/DOWN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think UP/DOWN has the implication of server being up or down. HEALTHY could be a better candidate here because it means it passed the healthcheck. On the other hand, UNHEALTHY means it failed the healthcheck, but the trino cluster is still UP

Thread.sleep(1000);
}
}
assertThat(newClusterHealthState).isEqualTo(TrinoHealthStateType.HEALTHY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a utility class that manages helper methods. I don't understand why this check & assertion exist here. Could you extract into a dedicated test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow. This check and assert is added to make sure when setupBackend is called, the backends state are healthy before returning to the caller.

{
Request request = prepareGet()
.setUri(uriBuilderFrom(URI.create(baseUrl)).appendPath("/v1/info").build())
.build();
try {
ServerInfo serverInfo = client.execute(request, SERVER_INFO_JSON_RESPONSE_HANDLER);
return !serverInfo.isStarting();
return serverInfo.isStarting() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.HEALTHY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why "pending" is "ready" status. It worth leaving a code comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if trino is still starting, its status should be "pending", otherwise its status will be healthy

@andythsu andythsu force-pushed the healthstate branch 2 times, most recently from 2b76c56 to 1bb4748 Compare October 8, 2024 03:16
@ebyhr
Copy link
Member

ebyhr commented Oct 11, 2024

Please fix CI failures.

@ebyhr ebyhr merged commit 98c8e03 into trinodb:main Oct 11, 2024
3 checks passed
@github-actions github-actions bot added this to the 12 milestone Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Trino gateway health state design
8 participants